An Index - Based Approach for Similarity Search Supporting TimeWarping in Large Sequence

نویسندگان

  • Sang-Wook Kim
  • Sanghyun Park
  • Wesley W. Chu
چکیده

This paper discusses an eeective processing of similarity search that supports time warping in large sequence databases. Time warping enables nding sequences with similar patterns even when they are of diierent lengths. Previous methods for processing similarity search that supports time warping fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan all the database, thus suuer from serious performance degradation in large databases. Another method that hires the suux tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a new novel method for similarity search that supports time warping. Our primary goal is to innovate on search performance in large databases without permitting any false dismissal. To attain this goal, we devise a new distance function D tw?lb that consistently underestimates the time warping distance and also satisses the triangular inequality. D tw?lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For eecient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes and D tw?lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we perform extensive experiments. The results reveal that our method achieves signiicant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data. The performance gain becomes larger: (1) as the number of data sequences gets larger, (2) the average length of data sequences gets longer, and (3) as the tolerance in a query gets smaller. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A heuristic approach for multi-stage sequence-dependent group scheduling problems

We present several heuristic algorithms based on tabu search for solving the multi-stage sequence-dependent group scheduling (SDGS) problem by considering minimization of makespan as the criterion. As the problem is recognized to be strongly NP-hard, several meta (tabu) search-based solution algorithms are developed to efficiently solve industry-size problem instances. Also, two different initi...

متن کامل

ارزیابی خودکار جویش‌گرهای ویدئویی حوزه وب فارسی بر اساس تجمیع آرا

Today, the growth of the internet and its high influence in individuals’ life have caused many users to solve their daily needs by search engines and hence, the search engines need to be modified and continuously improved. Therefore, evaluating search engines to determine their performance is of paramount importance. In Iran, as well as other countries, extensive researches are being performed ...

متن کامل

B-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database...

متن کامل

Fingerprinting and genetic diversity evaluation of rice cultivars using Inter Simple Sequence Repeat marker

Rice as one of the most important agricultural crops has a putative potential for ensuring food security and addressing poverty in the world. In the present study, in order to provide basic information to improve rice through breeding programs, Inter Simple Sequence Repeat marker (ISSR) was used For DNA fingerprinting and finding genetic relationships among 32 different cultivars. In this study...

متن کامل

The ed-tree: An Index for Large DNA Sequence Databases

The growing interest in genomic research has caused an explosive growth in the size of DNA databases making it increasely challenging to perform searches on them. In this paper, we proposed an index structure called the ed-tree for supporting fast and effective homology searches on DNA databases. The ed-tree is developed to enable probe-based homology search algorithms like Blastn which generat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001